Gapminder is an excellent organization aimed at increasing the use and understanding of statistics on a number of global topics. They collect a variety of data from many sources and aim to produce fact-based statistics reflecting the current state of our world. In addition, Gapminder has developed easily-accessible tools for visualizing the data in creative and informative ways.
The data we will be exploring throughout this guide consists of population, life expectency, and GDP information for many countries over time. If you would like to download this data yourself, click here. This data can also be pulled from the class GitHub repository.
In this document, our aims will be two-fold:
to gain some experience making visualizations with matplotlib, and
to illustrate tips and tricks in Jupyter notebooks rendered with Quarto.
Tip: Outside of the code chunks, we can use markdown and latex like normal. We can add bullet points, write in-line math equations (e.g., \(\sqrt{25} + \frac{1}{2}\)), longer math equations, etc.
We will begin by loading and cleaning the data. Then we will proceed to visualize the data in various ways.
Data
Let’s begin by loading and cleaning the data. To improve readability and modularity, I have written two external functions, load_gapminder_data and clean_gapminder_data, and source these functions from their file, data.py. Please open these files to see what load_gapminder_data and clean_gapminder_data are doing and note the function documentation.
from data import load_gapminder_data, load_percept_data, \ clean_gapminder_data, clean_probly_data, clean_numberly_dataimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npgapminder = clean_gapminder_data(load_gapminder_data())gapminder
Table 1: Loaded gapminder data
country
year
population
continent
life_exp
gdp_per_cap
0
Afghanistan
1952
8425333.0
Asia
28.801
779.445314
1
Afghanistan
1957
9240934.0
Asia
30.332
820.853030
2
Afghanistan
1962
10267083.0
Asia
31.997
853.100710
3
Afghanistan
1967
11537966.0
Asia
34.020
836.197138
4
Afghanistan
1972
13079460.0
Asia
36.088
739.981106
...
...
...
...
...
...
...
1699
Zimbabwe
1987
9216418.0
Africa
62.351
706.157306
1700
Zimbabwe
1992
10704340.0
Africa
60.377
693.420786
1701
Zimbabwe
1997
11404948.0
Africa
46.809
792.449960
1702
Zimbabwe
2002
11926563.0
Africa
39.989
672.038623
1703
Zimbabwe
2007
12311143.0
Africa
43.487
469.709298
1704 rows × 6 columns
Fortunately, the data was already very clean, so we did not conduct any major modifications to the data. When you do need to perform data cleaning, think carefully about the choices you make in the data cleaning stage. Be sure to document how you cleaned the data and why you made those choices.
Visualizing the gapminder data (matplotlib)
Next, we put our visualization skills to the test and create different plots with matplotlib.
First, we are interested in exploring life expectancy as a function of GDP. Create a scatterplot of life expectancy versus GDP for the year 2007 using matplotlib, where the size of points are based on the population of the country and they are colored by the continent the country resides in.
import matplotlib.pyplot as plt# use texplt.rc('text', usetex=True)# Create a scatter plot of life expectancy vs. GDP per capita in 2007, colored by continentgapminder_2007 = gapminder[gapminder['year'] ==2007]continents = gapminder_2007['continent'].unique()colors = ['blue', 'green', 'red', 'purple', 'orange']max_pop =max(gapminder_2007['population'])for i, continent inenumerate(continents): subset = gapminder_2007[gapminder_2007['continent'] == continent] plt.scatter(subset['gdp_per_cap'], subset['life_exp'], color=colors[i], s =200*subset['population']/max_pop, label=continent)legend = plt.legend()for handle in legend.legend_handles: handle.set_sizes([40])plt.title('Life expectancy vs. GDP per capita in 2007')plt.xlabel('GDP per capita')plt.ylabel('Life expectancy')plt.xscale('log')plt.show()
Figure 1: Scatter plot of life expectancy vs. GDP per capita in 2007.
This was extremely painful to do in matplotlib. How about using seaborn?
sns.scatterplot(data=gapminder_2007, x='gdp_per_cap', y='life_exp', hue='continent', size='population', hue_order=continents, palette=colors, sizes=(0, 500))plt.title('Life expectancy vs. GDP per capita in 2007')plt.xlabel('GDP per capita')plt.ylabel('Life expectancy')plt.xscale('log')plt.show()
Figure 2: Same figure as above, but done in seaborn
It certainly appears as though there is some kind of rapid increase in the low GDP range, which slows to a gradual increase in the high GDP range. Several African countries have surprisingly low life expectency for their GDP.
Next, we explore change in life expectancy over time. For each continent, we’ll use matplotlib to create a series of boxplots over time, where each data point corresponds to the life expectency of a country for the given year in the given continent.
fig, axs = plt.subplots(5, 1, figsize=(8,16))for i, continent inenumerate(continents): subset = gapminder[gapminder['continent'] == continent] subset.boxplot('life_exp', 'year', ax=axs[i]) axs[i].set_title(continent) axs[i].set_xlabel('Year') axs[i].set_ylabel('Life expectancy')# add some space between subplotsplt.tight_layout()# make axes have same ymin and ymaxfor ax in axs: ax.set_ylim(20, 90)
Figure 3: Boxplot of life expectancy over time
Again, this was a pain to do in matplotlib. Here’s seaborn:
g = sns.catplot(data=gapminder, x='year', y='life_exp', col='continent', kind='box', col_wrap=1, height=3, aspect=2, fill=False)for ax in g.axes: ax.set_ylabel('Life expectancy') ax.set_xlabel('Year') ax.tick_params(labelbottom=True) ax.set_title(ax.get_title().split('=')[1]) ax.spines[['top', 'right']].set_visible(True)g.set(title='Life expectancy over time by continent')plt.tight_layout()
Figure 4: Same figure as above, but done in seaborn
We see that the life expectancy increased in Africa from 1950 up until the 1990s but has stayed fairly constant with a median of around 50 years since the 1990s. The Americas, Asia, and Europe on the other hand, have experienced continued growth. Oceania seems to have very narrow results due to the few countries included.
Tip: We can change the size and shape of the figure by modifying figsize.
Now, compute the mean and variance of the GDP for each continent without using groupby.
# compute mean and variance of GDP per capita for each continent without using groupbyimport numpy as npfor continent in continents: subset = gapminder[gapminder['continent'] == continent]print(continent)print('Mean GDP:', int(np.round(np.mean(subset['gdp_per_cap']),0)))print('Variance GDP:', int(np.round(np.var(subset['gdp_per_cap']))))
Asia
Mean GDP: 7902
Variance GDP: 196774343
Europe
Mean GDP: 14469
Variance GDP: 87276908
Africa
Mean GDP: 2194
Variance GDP: 7984371
Americas
Mean GDP: 7136
Variance GDP: 40782196
Oceania
Mean GDP: 18622
Variance GDP: 38751808
Now, do the same using groupby
# do the same using groupbyresult = gapminder.groupby('continent')['gdp_per_cap'].agg(["mean", "var"])result
Table 2: Mean and variance of GDP per capita by continent
mean
var
continent
Africa
2193.754578
7.997187e+06
Americas
7136.110356
4.091859e+07
Asia
7902.150428
1.972725e+08
Europe
14469.475533
8.752002e+07
Oceania
18621.609223
4.043667e+07
# render the above as a latex tableprint(result.to_latex(float_format='%.0f'))
\begin{tabular}{lrr}
\toprule
& mean & var \\
continent & & \\
\midrule
Africa & 2194 & 7997187 \\
Americas & 7136 & 40918591 \\
Asia & 7902 & 197272506 \\
Europe & 14469 & 87520020 \\
Oceania & 18622 & 40436669 \\
\bottomrule
\end{tabular}
Next, we want to ask about raw GDP (i.e. overall GDP for each country, rather than standardized by per capita). Let’s create a table that shows the average GDP for each continent in 2007, as well as the number of countries in the continent, and the standard deviation of the GDP.
Now, we will: - play around with different themes and color schemes - learn about a few useful visualization tools
To guide you through this material, I will provide examples using the gapminder and perceptions data. Put briefly, the perceptions data deals with the perceptions of different words relating to probabilities and numbers. The raw data came from /r/samplesize responses to the following question: What [probability/number] would you assign to the phrase “[phrase]”? You can read more about the perceptions data at (https://github.com/zonination/perceptions).
Themes and Color Schemes
As I mentioned in the slides, it can become monotonous to look at 100+ plots with the same default matplotlib or seaborn or ggplot color scheme. A simple way to mix things up is to apply a different built-in matplotlib theme. Pick your favorite here or simply google “custom matplotlib themes” for a plethora of options. Keep in mind the data visualization guidelines we’ve gone over.
Let’s play around with the themes of our scatterplot above
def plot_gapminder(gapminder, title): gapminder_2007 = gapminder[gapminder['year'] ==2007] plt.figure(figsize=(8,6)) g = sns.scatterplot(data=gapminder_2007, x='gdp_per_cap', y='life_exp', hue='continent', size='population', hue_order=continents, palette=colors, sizes=(0, 500))# remove frame from legend g.legend(frameon=False) plt.title(title) plt.xlabel('GDP per capita') plt.ylabel('Life expectancy') plt.xscale('log') plt.show()sns.set_theme(style="darkgrid")plot_gapminder(gapminder, 'Default seaborn theme')sns.set_theme(style="white")plot_gapminder(gapminder, 'A nicer theme')plt.style.use('default')plot_gapminder(gapminder, 'Default matplotlib theme')plt.rcParams['text.usetex'] =Trueplt.rcParams['xtick.direction'] ='in'plt.rcParams['ytick.direction'] ='in'plt.rcParams['xtick.top'] =Trueplt.rcParams['ytick.right'] =Trueplot_gapminder(gapminder, 'The nicest theme??')
(a)
(b)
(c)
(d)
Figure 5
Now, we can compute the mean probability associated with each phrase in the probly data and plot the results in a bar graph.
plt.figure(figsize=(10, 6))mean_prob = probly.groupby('phrase', observed=False)['prob'].mean().reset_index()sns.barplot(x='phrase', y='prob', data=mean_prob, color='black')plt.xticks(rotation=45, ha='right')plt.xlabel("Phrase")plt.ylabel("Mean Probability")plt.title("Mean Probability by Phrase")plt.show()
Figure 6: Bar plot of mean probability by phrase
Choosing an appropriate color scheme for your plots can drastically improve the readability of your plots. Sometimes, it is worthwhile to stray from the default colors in your visualization library. The viridis color scheme in particular is very nice for continuous data. If you’re incredibly ambitious, you can even create your own color palettes.
# Color Schemes# Continuous color schemesplt.figure(figsize=(10, 6))base_plt = sns.scatterplot(x='life_exp', y='gdp_per_cap', hue='year', data=gapminder)plt.yscale('log')plt.title("Default Color Scheme (Continuous)")plt.show()
Rather than scatterplots, another type of graph that can often be very informative is a heatmap. In Figure 11 , we plot a heatmap of the life expectancy across time for various countries.
Figure 11: Heatmap of life expectancy by country and year.
We can make a similar plot using the probly data.
#| label: fig-heatmap2#| fig-cap: Heatmap of probability by phrase and sample# Heatmapsprobly_wide = probly.pivot(index='phrase', columns='id', values='prob')plt.figure(figsize=(10, 8))sns.heatmap(probly_wide, cmap="viridis", cbar_kws={'label': 'Probability'})plt.title("Probability assignments")plt.show()
It may be informative to look at multiple pair-wise relationships in the data in a single plot. The pairplot function of seaborn lets us do this, allowing us to plot the pairwise relationships between many different variables at once.
g = sns.pairplot(data=gapminder, vars=['population', 'gdp_per_cap', 'life_exp'], hue='continent')for axs in g.axes:for ax in axs: ax.spines[['top', 'right']].set_visible(True)plt.suptitle("Pair Plot", y=1.02)plt.show()
Figure 12: Pair plot of ‘population’, ‘gdp_per_cap’, ‘life_exp’ and colored by ‘continent’
We can also make “ridgeline” plots, as in this example:
plt.figure(figsize=(10, 6))sns.kdeplot(data=numberly, x='number', hue='phrase', fill=True, common_norm=False)plt.xlim(0, 50)plt.title("Perception of Numbers")plt.show()
Figure 13: Ridgeline Plots of Perception of Numbers
Sometimes, it may be useful to organize multiple plots side-by-side. Here is one way to do this in matplotlib.
# Side-by-side Plotsplt.figure(figsize=(15, 6))plt.subplot(1, 2, 1)sns.scatterplot(data=gapminder, x='population', y='life_exp', hue='continent')plt.xscale('log')plt.title("Population vs Life Expectancy")plt.subplot(1, 2, 2)sns.scatterplot(data=gapminder, x='gdp_per_cap', y='life_exp', hue='continent')plt.xscale('log')plt.title("GDP Per Capita vs Life Expectancy")plt.tight_layout()plt.show()
Figure 14: Side-by-side plots of Population vs Life Expectancy and GDP Per Capita vs Life Expectancy